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Abstract 

RNA  backbone  conformation  analysis  has  been  demon¬ 
strated  to  be  particularly  difficult  due  to  the  large  number 
of  torsion  angles  per  residue  and  the  large  variability  of 
the  raw  data.  Due  in  part  to  the  importance  of  local  struc¬ 
tures  in  the  understanding  of  RNA  catalysis  and  binding 
functions,  studies  in  this  area  have  recently  received  in¬ 
creased  attention.  In  this  work  we  use  classical  tools  from 
statistics  and  signal  processing  to  search  for  clusters  in  the 
RNA  backbone  torsion  angles.  Results  are  reported  both 
for  scalar  studies,  where  each  torsion  angle  is  separately 
studied,  and  for  vectorial  studies,  where  several  angles  are 
simultaneously  clustered.  Using  techniques  from  optimal 
quantization,  we  automatically  find  the  torsion  angle  clus¬ 
ters.  With  these  clustering  techniques,  we  find  RNA  back¬ 
bone  motifs,  both  at  the  single  residue  level  (phosphate- 
to-phosphate)  and  at  the  suites  level  (base-to-base)  pars¬ 
ing.  These  two  parsing  techniques  are  also  compared  us¬ 
ing  mutual  information  measurements.  We  conclude  the 
work  with  statistical  analysis  of  some  of  these  motifs,  and 
optimal  fitting  of  torsion  angle  distributions  in  the  most 
significant  clusters.  The  whole  process  is  fully  automatic 
and  based  on  well-defined  optimality  criteria. 


1  Introduction 

RNA  plays  an  important  role  in  storage  and  communica¬ 
tion  of  information,  as  well  as  in  other  important  biologi¬ 
cal  processes.  As  with  proteins,  the  3D  structure  of  RNA 
is  essential  for  performing  these  functions.  The  3D  struc¬ 
ture  of  RNA  is  different  than  that  of  proteins,  with  six 
torsion  angles  in  each  residue;  see  Figure  1. 
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The  work  described  here  follows  recent  efforts  in  study¬ 
ing  the  local  3D  structure  of  RNA,  e.g.,  [5,  9,  10,  11].  In 
this  paper  we  use  classical  techniques  from  statistical  sig¬ 
nal  processing  to  study  the  RNA  torsion  angles,  which 
are  illustrated  in  Figure  1;  see  also  [15].  We  present 
fully  automatic  techniques  to  search  for  motifs  (conform- 
ers/rotamers)  in  the  RNA  backbone,  both  at  the  level  of 
individual  residues  or  suites  and  at  the  level  of  a  group 
of  consecutive  ones.  Note  that  in  [5],  we  considered 
the  problem  of  finding  repeating  conformational  states 
( conformational  motifs )  and  representing  them  as  repeat¬ 
ing  strings  of  ASCII  characters.  The  use  of  quantiza¬ 
tion  makes  the  recent  approaches  of  [5,  9]  fully  automatic 
and  based  on  well  defined  distortion  and  quality  metrics.1 
Additional  statistical  analysis  techniques  demonstrated  in 
this  paper  are  mutual  information  to  compare  between 
residue  and  suite  parsing,  optimal  fitting  of  the  main  tor¬ 
sion  angle  clusters,  and  principal  component  analysis  of 
key  found  motifs. 

2  Scalar  and  Vector  Quantization 

In  this  section,  we  briefly  describe  the  basic  concepts  of 
vector  quantization  that  we  will  use  for  clustering.  Details 
on  this  technique  can  be  found,  e.g.,  in  [2],  from  which  we 
have  prepared  the  summary  we  now  present.  Note  that  in 
this  work  we  restrict  ourselves  to  the  use  of  this  cluster¬ 
ing  technique,  while  in  the  future  we  plan  to  use  more 
advanced  ones  such  as  those  reported  in  [12]. 2 

Vector  quantization  (VQ)  is  a  clustering  technique  orig¬ 
inally  developed  for  lossy  data  compression.  In  1980, 
Linde  et  al. ,  [8],  proposed  a  practical  VQ  design  algo¬ 
rithm  based  on  a  training  sequence.  The  use  of  a  training 
sequence  by-passes  the  need  for  multi-dimensional  inte¬ 
gration,  thereby  making  VQ  a  practical  technique,  imple¬ 
mented  in  most  scientific  computation  packages,  such  as 
Matlab  (www.mathworks.com). 

A  VQ  is  nothing  more  than  an  approximator.  The  idea 

1  Vector  quantization  was  used  in  the  context  of  protein  structure;  e.g., 

[6]. 

2We  should  also  note  that  vector  quantization  is  often  also  known  in 
the  literature  as  k -means  clustering. 
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Figure  1:  RNA  backbone  with  six  torsion  angles  labeled  on 
the  central  bond  of  the  four  atoms  defining  each  dihedral.  The 
two  alternative  ways  of  parsing  out  a  repeat  are  indicated:  A 
traditional  nucleotide  residue  goes  from  phosphate  to  phosphate 
( changing  residue  number  between  05’  and  P),  whereas  an  RNA 
suite,  which  is  more  appropriate  for  local  geometry  analysis, 
goes  from  sugar  to  sugar  (or  base  to  base).  Only  the  angles 
a,  7,  S,  and  (  are  investigated  in  this  study.  This  image  was 
obtained  from  [9],  where  the  reader  is  directed  for  a  detailed 
description  of  the  reasons  for  using  both  parsing  approaches. 

is  similar  to  that  of  “rounding-off”  (say  to  the  nearest  inte¬ 
ger).  An  example  of  a  1 -dimensional  VQ  is  shown  in  Fig¬ 
ure  2.  Here,  every  number  less  than  -2  are  approximated 
by  -3.  Every  number  between  -2  and  0  are  approximated 
by  -1.  Every  number  between  0  and  2  are  approximated 
by  +1.  Every  number  greater  than  2  are  approximated  by 
+3.  Figure  2  also  presents  a  two-dimensional  example. 
Here,  every  pair  of  numbers  falling  in  a  particular  region 
are  approximated  by  the  red  star  associated  with  that  re¬ 
gion. 
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Figure  2:  One  (top)  and  two  (bottom)  dimensional  examples 
of  clustering  via  (vector)  quantization.  All  the  points  in  a  given 
interval  (in  one -dimension)  or  a  given  cell  ( two -dimensions )  are 
represented  by  the  red  marked  “center.”  (This  is  a  color  figure.) 

The  VQ  design  problem  can  be  stated  as  follows.  Given 


a  vector  source  with  its  statistical  properties  known,  given 
a  distortion  measure,  and  given  the  number  of  desired 
codevectors,  find  a  codebook  (the  set  of  all  red  stars)  and  a 
partition  (the  set  of  blue  lines)  which  result  in  the  smallest 
average  distortion. 

We  assume  that  there  is  a  training  sequence  (e.g.,  the 
measured  torsion  angles  in  RNA  backbone)  consisting  of 
M  source  vectors  of  the  form  T  =  {aq, #2, #m}- 
We  assume  that  the  source  vectors  are  ^-dimensional, 
e.g.,  Xm  =  \%rn,l  1  Xm,2i  ♦♦♦?  for  1  <  171  <  M. 

Let  N  be  the  number  of  desired  codevectors  and  let 
C  =  {ci,  C2, cn}  be  the  codebook,  where  each  cn, 
1  <  n  <  N,  is  of  course  ^-dimensional  as  well.  Let 
Sn  be  the  cell  associated  with  the  codevector  cn  and  let 
P  =  {Si,  £2, Sn}  be  the  corresponding  partition  of 
the  ^-dimensional  space.  If  the  source  vector  xm  is  in  the 
encoding  region  £n,  then  its  approximated  by  cn,  and  let 
denote  by  Q(xm)  =  cn  (if  xm  E  Sn)  such  a  map.  Then, 
assuming  for  example  a  squared  error  distortion  measure, 
the  average  distortion  is  given  by  D  —  J2m=i  M  II 
xm  ~  Q(xm)  ||2,  where  ||  e  ||2=  e\  +  e\  +  ...  +  ef . 

The  design  problem  then  becomes  the  following:  Given 
the  training  data  set  T  and  the  number  of  desired  code¬ 
books  (or  clusters)  N,  find  the  cluster  centers  C  and  the 
space  partition  P  such  that  the  distortion  D  is  minimized. 
This  problem  can  be  efficiently  solved  with  the  LBG  algo¬ 
rithm  [4,  8],  and  as  mentioned  above,  its  implementation 
can  be  found  in  most  of  the  popular  scientific  computing 
programs. 

3  Clustering  the  RNA  Backbone 
Torsion  Angles 

We  first  report  results  from  scalar  quantization,  where 
each  one  of  the  angles  are  studied  separately.  Once  this 
is  done,  we  will  analyze  all  torsion  angles  as  a  vector.  We 
use  two  data  sets.  One  follows  the  work  reported  in  [5], 
and  is  for  a  single  RNA  with  2914  residues  (HM  LSU 
23 S  rRNA,  rr0033),  while  the  second  one  follows  work 
reported  in  [9],  and  is  for  a  collection  of  132  RNAs,3  giv¬ 
ing  a  total  of  10463  residues.  Here,  as  in  the  rest  of  this 
work,  residues  with  unknown  torsion  angles  were  ignored 
in  the  analysis.  The  data  was  obtained  from  the  Nucleic 
Acid  Database  [13].  Although  we  have  not  performed  the 

3 With  NDB  and  PDB  codes:  arOOOl,  02,  04,  05,  06,  07,  08,  09,  11, 
12,  13,  20,  21,  22,  23,  24,  27,  28,  30,  32,  36,  38,  40,  44;  arb002,  3,  4, 
5;  arf0108;  arh064,  74;  arl037,  48,  62;  arn035;  dr0005,  08,  10;  drb002, 
03,  05,  07,  08,  18;  drd004;  pd0345;  pr0005,  06,  07,  08,  09,  10,  11,  15, 

17,  18,  19,  20,  21,  22,  26,  30,  32,  33,  34,  36,  37,  40,  46,  47,  51,  53, 

55,  57,  60,  62,  63,  65,  67,  69,  71,  73,  75,  78,  79,  80,  81,  83,  85,  90, 

91;  prvOOl,  04,  10,  20,  21;  pte003;  ptr004,  16;  rr0005,  10,  16,  19,  33; 

trOOOl;  trnal2;  uhOOOl;  uhx026;  urOOOl,  04,  05,  07,  09,  12,  14,  15, 
19,  20,  22,  26;  urb003,  08,  16;  urc002;  urf042;  url029,  50;  urt068;  and 
urx053,  59,  63,  75. 
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filtering  techniques  in  [9],  these  might  be  used  to  improve 
our  results.  As  in  [5],  we  here  limit  the  analysis  to  the 
torsion  angles  a ,  7,  S ,  (  (see  Figure  1),  since  the  other 
ones  are  either  dependent  with  respect  to  these  ones  or 
have  unimodal  distributions  [14,  16].  There  is  no  intrin¬ 
sic  limitation  in  our  technique  in  working  only  with  this 
reduced  set  of  angles  (moreover,  being  the  process  fully 
automatic,  the  work  can  certainly  be  carried  out  for  larger 
sets),  but  this  will  clarify  the  presentation. 

In  Figure  3  we  show  the  distributions  for  these  four 
angles  for  the  two  datasets.  A  few  remarkable  things  to 
notice  are  the  following.  First,  the  distributions  are  very 
similar  for  both  datasets,  pointing  out  to  the  fact  that  the 
local  structures  are  not  only  “rotameric”  for  a  given  RNA 
(first  data  set)  but  also  across  RNAs  (second  dataset).  Sec¬ 
ondly,  although  the  distributions  for  a  and  (  are  very  sim¬ 
ilar  (since  these  can  be  considered  analogous  angles),  the 
secondary  picks  for  (  are  much  broader  and  less  well  de¬ 
fined,  Figure  4.  This  has  been  the  subject  of  controversy, 
and  for  example,  the  authors  of  [9]  solve  this  by  filtering, 
and  then  reporting  more  clusters  than  in  the  non-filtered 
approach  in  [5].  Still,  although  this  filtering  is  important 
in  the  analysis,  it  doesn’t  explain  the  unique  long  tail  in 
the  (  distribution;  see  also  [15].  In  particular,  note  that 
the  rotation  of  ( is  sterically  more  restricted  than  that  of  a 
by  proximity  to  the  furanose  ring.  Here,  we  will  limit  our 
analysis  (see  below)  to  what  the  VQ  statistical  analysis 
tells  us,  working  with  the  raw  data  and  without  any  addi¬ 
tional  constraints.  Understanding  this  difference  between 
the  a  and  (  torsion  angles  is  something  that  intrigues  us 
and  we  hope  to  address  in  the  near  future. 

Using  the  automatic  and  optimal  quantization  tech¬ 
nique,  and  requesting  the  number  C  of  codevectors  fol¬ 
lowing  [5]  (or  just  from  visual  inspection)  we  found  the 
codevectors  or  centers  of  the  clusters  given  in  Table  1 . 


a 

7 

c 

Dataset  1 

68.3  (1),  169.7  (2),  294.3  (3) 

50.4,  60.0  (1),  175.8  (2),  292.3  (3) 
81.7  (1),  147.8  (2) 

118.0  (2),  286.7  (1) 

Dataset  2 

a 

68.6  (1),  167.8  (2),  294.0  (3) 

7 

50.1,  65.0  (1),  174.4  (2),  290.2  (3) 

82.7  (1),  144.4  (2) 

c 

116.4  (2),  286.0(1) 

Table  1 :  Cluster  centers  automatically  computed  by  our  tech¬ 
nique.  Numbers  in  parenthesis  are  used  for  cluster  identifica¬ 
tion. 

We  note  once  again  the  very  similar  results  for  both  data 
sets.  We  should  also  note  that  for  7,  two  of  the  centers  are 
very  close  to  each  other,  and  will  be  considered  just  one 


Figure  3 :  Cumulative  distributions  of  the  torsion  angles  a, 
7,  5,  and  (  for  the  single  RNA  (first  two  rows)  and  the  collec¬ 
tion  of  RNAs  (last  two  rows).  We  observe  the  similitude  among 
the  distributions,  marking  the  presence  of  “rotamers,,  not  only 
for  a  given  RNA  but  also  across  RNAs.  We  also  observe  clear 
modes,  which  are  automatically  detected  by  the  proposed  clus¬ 
tering  technique.  In  addition,  note  that  the  Q  torsion  angle  has  a 
large  tail  not  present  in  the  other  distributions. 


when  we  proceed  to  cluster  the  data.  Note  also  that  al¬ 
though  we  have  pre-defined  the  number  of  clusters,  this 
could  also  be  left  as  part  of  the  automatic  process,  for  ex¬ 
ample  via  the  expectation  minimization  (EM)  algorithm. 
We  have  observed  that  increasing  the  number  of  clusters 
doesn’t  produce  a  significant  change  in  the  distortion  D , 
indication  that  the  selected  number  of  clusters  is  enough. 
Regarding  £,  if  additional  clusters  are  requested,  e.g.,  3 
clusters,  for  the  first  dataset  these  are  automatically  found 
at  85.86,  188.25,  and  289.27,  thereby  splitting  the  large 
tail  (following  the  directions  reported  in  [9]). 

We  should  also  comment  on  the  particular  distributions 
in  each  cluster.  There  are  a  number  of  reasons  for  the  vari¬ 
ability  inside  each  cluster,  and  therefore  it  is  important  to 
understand  the  possible  statistical  explanation  for  it,  since 
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whole.  We  could  then  use  this  as  well,  instead  of  the  scalar 
work  which  we  continue  below  as  the  basis  for  vectorial 
clustering. 


Figure  4:  The  tail  of  £  for  the  second  dataset.  Although  two 
picks  can  he  “guessed,”  the  distribution  is  much  more  flat  than 
for  example  for  the  a  torsion  angle. 

this  is  connected  to  problems  in  the  data  acquisition  but 
also  to  the  RNA  dynamics.  We  have  experimented  with 
a  number  of  fitting  functions,  and  we  have  observed  that 
the  best  fitting  (with  a  significant  improvement)  for  the 
major  clusters  is  obtained  using  exponential  distributions, 
and  not  Gaussian  ones  as  argued  for  example  in  [5].  For 
example,  for  the  first  dataset,  the  kurtosis  for  the  main 
cluster  is  5.3  for  a  and  4.6  for  £,  clearly  indicating  a  sig¬ 
nificant  deviation  from  Gaussian  distributions.  The  log- 
likelihood  while  fitting  an  exponential  function  improves 
by  24%  with  respect  to  fitting  a  Gaussian  for  the  a  torsion 
angle  and  by  23%  for  the  (  torsion  angle.  Similar  behav¬ 
ior  is  observed  for  the  other  dataset,  although  sometimes 
the  improvement  is  a  bit  more  moderate  (e.g.,  for  the  first 
mode  of  a  in  the  first  dataset,  the  improvement  is  of  16%). 
Understanding  the  distributions  in  each  cluster  is  crucial 
for  future  steps  of  this  research,  namely  probabilistic  de¬ 
sign. 

3.1  Vector  Quantization  and  Binning 

The  results  described  above  address  the  scalar  quantiza¬ 
tion  of  the  torsion  angles,  and  will  already  lead  to  the 
fully  automatic  motif  finding  technique  reported  in  the 
next  section.  We  can  of  course  also  perform  vector  quanti¬ 
zation,  and  provide  this  way  an  additional  automatic  way 
to  study  the  vector  clusters,  without  the  need  to  perform 
visualization  based  decisions  such  as  those  in  [5,  9].  For 
example,  if  we  request  6  centers  for  the  pair  (a,  £),  we 
obtain  (167.6, 284.6),  (291.4, 189.2),  (69.1, 284.7), 
(294.4, 289.4),  (105.1, 110.5),  (287.4, 86.7).4 

We  note  that  the  a  component  of  the  automatically  de¬ 
tected  centers  is  as  in  the  case  of  scalar  quantization,  while 
the  (  component  includes  terms  that  appear  both  when  we 
request  2  and  3  bins  for  (  in  the  scalar  case.  Perform¬ 
ing  this  vectorial  analysis,  for  2  or  more  torsion  angles 
together,  gives  us  information  on  the  importance  of  the 
distribution  centers  when  the  angles  are  considered  as  a 

4  These  results  are  for  residue-based  pars¬ 
ing,  while  for  suite-based  parsing  we  obtain 
(167.6,  284.6),  (287.5,  86.7),  (294.4,  289.4),  (105.3,  109.8), 

(291.4,  189.2),  (69.2, 284.2).  More  details  in  these  two  types  of 
parsing  are  provided  below. 


4  Automatically  Finding  Motifs 

With  the  above  automatic  procedure,  we  can  proceed  and 
find  motifs.  Basically,  we  cluster  the  torsion  angles  ac¬ 
cording  to  their  proximity  to  the  centers  in  Table  1 .  In  the 
results  reported  below,  we  have  not  considered  a  “dead 
zone”  (equivalent  to  the  manually  defined  bins  “other” 
in  [5],  and  to  some  of  the  results  from  the  filtering  ap¬ 
proach  in  [9]),  and  each  torsion  angle  is  classified  to  one 
of  the  clusters.  Following  the  filtering  approach  in  [9]  and 
the  “other”  bins  in  [5],  we  could  be  more  conservative 
and  only  consider  torsion  angles  that  are  at  a  certain  dis¬ 
tance  of  the  cluster  centers,  while  considering  the  rest  as 
“noise.”  This  of  course  is  done  also  in  an  automatic  fash¬ 
ion,  for  example  requesting  the  angles  to  be  at  p  times  the 
variance  inside  the  class.  Therefore,  the  technique  here 
proposed  provides  not  only  an  automatic  clustering  ap¬ 
proach,  but  also  a  way  to  filter  out  data  if  so  desired. 

Using  the  notation  in  Table  1,  we  present  in  Table  3 
the  most  frequent  cells  for  the  residues  in  both  datasets 
(left  and  right  for  each  pair),  and  for  residue  and  suite 
parsing  (left  and  right  pairs).  Similar  results  were  reported 
in  [5]  for  the  first  dataset  and  for  residue  parsing  (that  is, 
corresponding  only  to  the  top-left  table),  where  the  cluster 
centers  and  boundaries  were  defined  manually. 

The  next  step  if  of  course  to  look  for  motifs  for  more 
than  one  consecutive  residue.  In  Table  2,  we  report  the 
larger  A-helices  we  automatically  found  (these  are  given 
by  the  composition  3111,  see  [5])  in  each  residue  of  the 
first  dataset. 

We  also  found  27  tetraloops  (defined  by  the  series  3111, 

3111,  2111,  3111),  starting  at  positions  149,  252,  313, 
468,  505,  624,  690,  804,  1054,  1197,  1326,  1388,  1468, 
1499,  1595,  1628,  1706,  1748,  1793,  1808,  1862,  1991, 
2061,  2248,  2411,  2629,  2695;  and  four  e-strands  (3111, 

3112,  2122,  3222,  3111)  starting  at  locations  172,  210, 
1367,  2689. 

5  Residue  vs.  Suite  Parsing 

RNA  can  be  parsed  by  residues  or  by  suites  as  in  [9] ;  see 
Figure  1 .  The  motivation  for  the  latter  is  the  high  corre¬ 
lation  between  the  adjacent  phosphate  torsional  angles  £ 
and  a.  This  correlation  was  established  for  dinucleotides 
and  short  oligonucleotides  [15].  Here  we  will  extend  the 
relation  to  any  RNA  molecule  using  information  theory. 

To  try  to  further  understand  the  differences  between  the 
two  forms  of  parsing  the  RNA  backbone,  we  computed 
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Starting  residue 

Length 

12 

12 

98 

10 

294 

10 

343 

13 

399 

10 

418 

10 

519 

13 

589 

14 

606 

13 

747 

12 

796 

10 

1014 

14 

1139 

10 

1217 

12 

1261 

16 

1291 

20 

1317 

11 

1329 

11 

1453 

17 

1507 

17 

1535 

24 

1606 

10 

1760 

11 

1843 

12 

1896 

23 

1920 

21 

2259 

12 

2429 

13 

2542 

10 

2621 

10 

2708 

10 

Table  2:  Location  and  length  of  larger  A-helices  automatically 
found  in  the  first  dataset. 


the  mutual  information  between  a  and  £,  both  for  residue 
parsing  (a(i)  against  £(i))  and  for  suite  parsing  (a(i) 
against  ((i  —  1)).  Mutual  information  is  defined  as  follows 
[1]:  Let  x  and  y  be  two  random  variables.  First,  the  en¬ 
tropy  of  x  is  defined  as  H(x)  :=  —  Ex[\og(P(x)],  where 
Ex[-\  stands  for  the  expectation.  Entropy  measures  (in 
bits)  the  randomness  of  a  signal,  the  larger  the  entropy  the 
more  random  the  variable  is.  The  joint  entropy  is  defined 
as  H(x,y)  :=  —  Ex[Ey[\og(P(x,  y))]\,  and  summarizes 
the  degree  of  dependence  of  x  on  y,  while  the  conditional 
entropy  if  given  by  H(y\x)  :=  f f  1< >,u,(  / >(  // 1./ ) )]], 
which  summarizes  the  randomness  of  y  given  knowledge 
of  x.  We  can  now  define  the  mutual  information , 

MI(x,y)  :=  H(y)—H(y\x)  =  H(x)  +  H(y)-H(x,y), 

which  is  a  measure  of  the  reduction  of  the  entropy  (ran¬ 


domness)  of  y  given  x. 

In  the  case  of  residual  parsing,  we  obtained 
MI(a,()  —  0.83,  while  for  suites  parsing  we  obtain 
MI(a,()  —  1.16. 5  This  increase  in  mutual  informa¬ 
tion  indicates  that  the  suites  parsing  is  more  appropri¬ 
ate  (as  claimed  in  [9]),  at  least  that  these  torsion  angles 
are  functionally  more  dependent  with  this  parsing.6  We 
should  add,  for  completeness,  that  M/(ct,  7)  =  0.82 
(#(7)  =  3.56),  Ml{a,6)  =  0.46  (H(S)  =  2.74),  and 
M/(7,tf)  =  0.38. 


6  Principal  Component  Analysis  of 
Tetraloops 


As  done  for  secondary  structures  in  protein  research,  e.g., 
[3],  it  is  important  to  study  the  variability  of  the  motifs 
found  in  RNA,  due  once  again  to  its  possible  implications 
in  the  dynamics.  Following  the  work  on  proteins  [3],  we 
perform  principal  component  analysis  (PCA)  on  the  27 
tetraloops  reported  above  and  in  an  additional  larger  data 
set. 

The  basic  procedure  is  as  follows.  Let  L  denote  the 
number  of  residues  in  the  motif  (L  =  4  for  tetraloops) 
and  N  the  number  of  samples  (27  for  our  first  example). 
The  first  step  in  the  PCA  is  to  compute  the  covariance  ma¬ 
trix  C,  which  is  a  square  matrix  of  dimension  4 L  (four 
angles  per  each  residue),  whose  elements  are  given  by 

CiJ  =  N  —  i  'Y2m= ^  ^ 

where  <  Xi  >  ,is  the  i- th  coordinate  of  the  mean  struc¬ 
ture.  We  then  compute  the  eigenvalues  and  eigenvec¬ 
tors  of  this  matrix,  Xq  and  vq.  The  eigenvalues  distribu¬ 
tion  will  tell  us  the  number  of  modes  in  this  class.  In 
Figure  5,  top,  we  clearly  see  2  to  3  dominant  eigenval¬ 
ues  for  this  data  set,  considering  the  4  angles  (a,  7,  5,  £). 
In  the  middle,  we  repeat  the  computation  for  a  total  of 
261  tetraloops,7  considering  now  all  the  six  torsion  angles 
(a,  /?,  7 ,  5,  e,  (“),  and  defining  a  tetraloop  as  the  combina¬ 
tion  (3?11?1,3?11?1,2?11?1,3?11?1),  where  the  sym¬ 
bol  ?  stands  for  “don’t  care”  for  those  angles.  We 
observe  again  the  2  (maximum  3)  dominant  eigenval¬ 
ues  (analysis  of  the  eigenvectors  will  be  reported  else¬ 
where).  When  using  the  same  data  set,  again  with 
all  the  six  torsion  angles,  but  defining  a  tetraloop  as 
(3?11?1,2?11?1,3?11?1,3?11?1)  we  obtain  168  exam¬ 
ples.  The  eigenvalues  distribution  is  shown  in  the  last  fig¬ 
ure  on  the  bottom,  with  two  dominant  eigenvalues  once 


5Both  a  and  (  have  H  =  4.59. 

6For  computing  the  MI,  we  quantized  the  a  and  £  torsion  angles  in 
100  bins.  We  also  tested  for  different  numbers  of  bins  and  always  the 
mutual  information  increased  for  suite  parsing. 

7rr0011,  rr0033,  rr0055,  rr0043,  rr0044,  rr0060,  rr0061,  rr0077, 
rr0078  and  rr0079;  HLSU  50  from  NDB. 


5 


again,  even  stronger  than  before.8  Note  that  the  first  and 
second  histograms  of  Table  5  refer  to  “tetraloops”  in  the 
sense  just  defined,  while  the  third  histogram  refers  the 
“tetraloops”  in  the  standard  sense  [7,  18]. 

We  have  used  simple  (and  linear)  analysis  in  this  case, 
while  there  is  no  reason  to  believe  that  the  space  of  RNA 
motifs  is  flat.  We  plan  to  investigate  the  use  of  tools  that 
consider  the  geometry  of  the  space  of  motifs,  e.g.,  [17], 
where  orders  of  magnitude  more  data  will  be  needed. 


L 

°° 

lllllllln.. 

Illllu...-- 

Figure  5:  Frequency  plots  of  eigenvalues  corresponding  to  the 
tetraloops  PCA  analysis.  The  first  two  plots  use  tetraloops  in  the 
sense  defined  in  this  paper  while  the  third  in  the  standard  sense. 


7  Concluding  Remarks 

In  this  paper  we  have  seen  how  classical  techniques  from 
statistical  signal  processing  are  useful  for  the  analysis  of 
RNA  structure.  These  techniques  can  be  augmented  with 
novel  clustering  approaches  being  developed  by  the  learn¬ 
ing  and  signal  processing  community,  and  investigating 
those,  together  with  the  search  for  new  motifs,  is  the  sub¬ 
ject  of  our  current  efforts. 


8 The  stability  of  these  motifs,  and  comparison  between  residue  and 
suite  parsing,  is  the  subject  of  current  studies. 
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Freq. 

3  111 

1835 

3  12  1 

136 

22  1  1 

125 

3  112 

92 

2  111 

52 

2  112 

42 

12  12 

40 

3  122 

37 

2  122 

36 

1122 

36 

1  1  1  1 

35 

32  11 

31 

1112 

31 

2  12  1 

24 

112  1 

22 

132  1 

19 

1322 

15 

13  11 

14 

33  12 

13 

222  1 

12 

3  3  2  1 

12 

3222 

11 

1222 

10 

32  12 

9 

322  1 

8 

22  1  2 

8 

12  11 

7 

13  12 

7 

Freq. 

3  111 

6946 

22  11 

630 

3  12  1 

375 

3  112 

298 

2  111 

206 

2  112 

148 

12  12 

144 

32  11 

123 

1112 

120 

3  122 

119 

1  1  1  1 

104 

1  122 

91 

13  11 

84 

12  11 

76 

2  12  1 

71 

22  12 

68 

2  122 

64 

112  1 

58 

222  1 

43 

132  1 

38 

322  1 

34 

33  12 

34 

32  12 

32 

3222 

28 

1322 

27 

1222 

26 

3  3  2  1 

26 

13  12 

23 

erf 

Freq. 

3  111 

1812 

22  1  1 

125 

3  122 

114 

3  112 

111 

2  111 

86 

3  12  1 

58 

1111 

47 

12  11 

42 

2  122 

39 

112  1 

38 

32  11 

30 

1322 

23 

2  12  1 

21 

13  11 

20 

1122 

20 

1112 

19 

3222 

13 

33  11 

13 

2222 

12 

132  1 

11 

3  3  2  1 

10 

32  12 

10 

122  1 

9 

2  112 

7 

322  1 

6 

3  3  22 

6 

Freq. 

3  111 

6702 

22  1  1 

593 

3  112 

337 

3  122 

294 

2  111 

294 

3  12  1 

187 

12  11 

182 

1  1  1  1 

161 

32  11 

111 

13  11 

91 

112  1 

77 

22  1  2 

74 

2  122 

70 

1122 

70 

2  12  1 

58 

2  112 

54 

1112 

53 

33  11 

41 

3222 

40 

32  12 

40 

1322 

39 

2222 

38 

12  12 

37 

122  1 

27 

132  1 

24 

3  3  2  1 

23 

Table  3:  Frequency  of  most  popular  torsion  angles  motifs,  both  for  residue  parsing  (first  two  columns )  and  suite  parsing  (last  two 
columns ).  The  table  on  the  left  of  each  pair  corresponds  to  the  first  dataset  while  the  one  on  the  right  corresponds  to  the  second 
dataset.  Note  that  angles  of  the  first  two  columns  correspond  to  the  same  residue,  while  the  last  two  columns  to  suites;  see  Figure 
1. 
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